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A SMOOTH PRIMAL-DUAL OPTIMIZATION FRAMEWORK FOR 
NONSMOOTH COMPOSITE CONVEX MINIMIZATION 

QUOC TRAN-DINH*, OLIVIER FERCOQ 1 ", AND VOLKAN CEVHER* 


Abstract. We propose a new first-order primal-dual optimization framework for a convex optimization tem¬ 
plate with broad applications. Our optimization algorithms feature optimal convergence guarantees under a variety 
of common structure assumptions on the problem template. Our analysis relies on a novel combination of three 
classic ideas applied to the primal-dual gap function: smoothing, acceleration, and homotopy. The algorithms due 
to the new approach achieve the best known convergence rate results, in particular when the template consists 
of only non-smooth functions. We also outline a restart strategy for the acceleration to significantly enhance the 
practical performance. We demonstate relations with the augmented Lagrangian method and show how to exploit 
the strongly convex objectives with rigorous convergence rate guarantees. We provide numerical evidence with two 
examples and illustrate that the new methods can outperform the state-of-the-art, including Chambolle-Pock, and 
the alternating direction method-of-multipliers algorithms. 

Keywords: Gap reduction technique; first-order primal-dual methods; augmented Lagrangian; smoothing tech¬ 
niques; homotopy; separable convex minimization; parallel and distributed computation. 

AMS subject classifications. 90C25, 90C06, 90-08 

1. Introduction. We propose a new analysis framework for designing primal-dual opti¬ 
mization algorithms to obtain numerical solutions to the following convex optimization template 
described in the primal space: 


P * := min [p{x) := f(x) + g(Ax)}, 


( 1 . 1 ) 


where / : R" -> lU {+ 00 } and g : R m -)■ 111 {+ 00 } are proper, closed and convex functions; 
X = dom(P) is the domain of P, and A £ R mx " i s given. For generality, we do not impose any 
smoothness assumption on / and g. In particular, we refer to <0> as a nonsmooth composite 
minimization problem. 


Associated with the primal problem (1.11, we define the following dual formulation: 


D* ■= 


max{D(y) := -f*{-A T y) - 3*(y)j, 


( 1 . 2 ) 


where /*(•) and g* ar e the Fenchel conjugate of / and q. resp ectively, and y = dom(U) is the 
domain of D. Clearly, (1.2) has exactly the same form as (1.1) in the dual space. 


The templates (1.1)-(1.2) provide a unified formulation for a broad set of applications in 
various disciplines, see, e.g., 0 EH m m [Ml US [57]. While problem (|1.1[) is presented in the 


unconstrained form, it automatically covers constrained settings by means of indicator functions. 
For example, < 0 > covers the following prototypical optimization template via g(z) := <5{ c j( 2:) 
(i.e., the indicator function of the convex set {c}): 


/* : = min {/(cc) + <5 {c} (Ax )} = min {f{x) : Ax = c}, 


(1.3) 


where / is a proper, closed and convex function as in & Note that (H is sufficiently gen¬ 
eral to cover standard convex optimization subclasses, such as conic programming, monotropic 
programming, and geometric programming, as specific instances 013 El- 

Among classical convex optimization methods, the primal-dual approach is perhaps one of 
the best candidates to solve the primal-dual pair (!.!)-(1.2). Theory and methods along this 
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approach have been developed for several decades and have led to various algorithms, see, e.g., 

ID EB113 E3 E3113 US EE3 US HDHU m HE US 1311132 M 133131 EH S3 S3 Ei [SHIES] , and the 

references quoted therein. A more thorough comparison between existing primal-dual methods 
and our approach in this paper is postponed to Section [7] There are several reasons for our 
emphasis on first-order primal-dual methods for ( 1.1 )-( | 1 . 2 [ ), with the most obvious one being their 
scalability. Coupled with recent demand for low-to-medium accuracy solutions in applications, 
these methods indeed provide important trade-offs between the complexity-per-iteration and the 
iteration-convergence rate along with the ability to distribute and decentralize the computation. 

Unfortunately, the newfound popularity of primal-dual optimization has lead to an explosion 
in the number of different algorithmic variants, each of which requires different set of assumptions 
on problem settings or methods, such as strong convexity, error bound conditions, metric regular¬ 
ity, Lipschitz gradient, Kurdyka-Lojasiewicz conditions or penalty parameter tuning [IH EH 133 - 
As a result, the optimal choice of the algorithm for a given application is often unclear as it 
is not guided by theoretical principles, but rather trial-and-error procedures, which can incur 
unpredictable computational costs. A vast list of references can be found, e.g., in [13} [50]. 

To this end, we address the following key question: “Can we construct heuristic-free, acceler¬ 
ated first-order primal-dual methods for nonsmooth composite minimization that have the best 
convergence rate guarantees?” To our knowledge, this question have never been addressed fully 
in a unified fashion in this generality. It is obvious that our theory presented in this paper is 
still applicable to the smooth cases of / without requiring neither Lipschitz gradient nor strongly 
convex-type assumption. Such a model covers serval important applications such as graphical 
learning models and Poisson imaging reconstruction [551 . 

1.1. Our approach. Associated with the primal problem (1.1) and the dual one (1.2), we 
define 


G(w) := P{x) - D(y ), 


(1.4) 


as a primal-dual gap function, where w is the primal-dual variable. The gap function G in (1.4) 


is convex in terms of the concatenated primal-dual variable w := ( x , y). Under strong duality, we 


have G[w*) = 0 if and only if w* = (x*,y*) is a primal-dual solution of (1.1) and (1.2). 


The gap function (1.41 is widely used in convex optimization and variational inequalities, see, 


e.g., E2- Several researchers have already used the gap function as a tool to characterize the 
convergence of optimization algorithms, e.g., within a variational inequality framework [1311291 [48] . 

In stark contrast with the existing literature, our analysis relies on a novel combination of 
three ideas applied to the primal-dual gap function: smoothing, acceleration, and homotopy. 
While some combinations of these techniques have already been studied in the literature, their 
full combination is important and has not been studied yet. 

Smoothing: We can obtain a smoothed estimate of the gap function within Nesterov’s smooth¬ 
ing technique applied to / and g il. In the sequel, we denote the smoothed gap function 
G 7 j q(w) := Pp(x) — D„, (y) to G(w), where Pp is a smoothed approximation to P depending on 
the smoothness parameter /3 > 0, and D 1 is a smoothed approximation to D depending on the 
smoothness parameter 7 > 0. By smoothed approximation, we mean the same max-form approx¬ 
imation as |44j . However, it is still unclear how to properly choose these smoothness parameters. 

Acceleration: Using an accelerated scheme, we will design new primal-dual decomposition 
methods that satisfy the following smoothed gap reduction model: 


G lk+1 p k+1 (w k+1 ) < (1 - T k )G lkPk [w k ) + ip k , 


(1.5) 


where {u) fc } and the parameters are generated by the algorithms with 17 . £ [ 0 , 1 ) and {max {ipk, 0 }} 
converges to zero. Similar ideas have been proposed before; for instance, Nesterov’s excessive gap 
technique 


is a special case of the gap reduction model (1.5) when ipk < 0. 


Homotopy: We will design algorithms to maintain (1.51 while simulatenously updating B k 


7 k and Tfc to zero to achieve the optimal convergence rate based on the assumptions on the 
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problem template. This strategy will also allow our theoretical guarantees not to depend on the 
diameter of the feasible sets. A similar technique is also proposed in [33] , but only for symmetric 
primal-dual methods. It is also used in conjunction with Nesterov’s smoothing technique in [ 8 ] 
for unconstrained problem but had only the 0(\n(k)/k) convergence rate. 

Note that without homotopy, we can directly apply Nesterov’s accelerated methods to mini¬ 
mize the smoothed gap function G 7i g for given 7 > 0 and /3 > 0. In this case, these smoothness 
parameters must be fixed a priori depending on the desired accuracy and the prox-diameter of 
both the primal and dual problems, which may not applicable to < 0 > or ( |1.3| ) due to the un¬ 
boundedness of the dual feasible domain. 

1.2. Our contributions. To this end, the main contributions of this paper can be summa¬ 
rized as follows, which consists of three parts: 

(a) ( Theory) We propose to use differentiable smoothing prox function to smooth both primal 
and dual objective functions, which allows us to update the smoothness parameters in 
a heuristic-free manner. We introduce a new model-based gap reduction condition for 
constructing optimal first-order primal-dual methods that can operate in a black-box 
fashion (in the sense of [44] I. Our analysis technique unifies several classical concepts in 
convex optimization, from Auslander’s gap function and Nesterov’s smoothing technique 
to the accelerated proximal gradient descent method, in a nontrivial manner. We also 
prove a fundamental bound on the primal objective residual and the feasibility violation 
for ( |1.3[ ), which leads to the main results of our convergence guarantees. 

(b) (Algorithms and convergence theory) We propose two novel primal-dual first order algo¬ 
rithms for solving (1.1 1 and ( |1.3| ). The first algorithm requires to perform only one primal 
step and one dual step without using primal averaging scheme. The second algorithm 
needs one primal step and two dual steps but using a weighted averaging scheme on the 
primal. We prove the 0(l/k) convergence rate on the objective residual P(x k ) — P* of 


( 1 . 1 ) for both algorithms, which is the best known in the literature for the fully non¬ 
smooth setting case. For the constrained case (1.3), we also prove the convergence of 


both algorithms in terms of the primal objective residual and the feasibility violation, 
both achieve the 0{l/k) convergence rate, and are independent of the prox-diameters 
compared to existing smoothing techniques umig. 

(c) (Special cases ) We illustrate that the new techniques enable us to exploit additional 
structures, including the augmented Lagrangian smoothing, and the strong convexity 
of the objectives. We show the flexibility of our framework by applying it to different 
constrained settings including conic programs. 

Let us emphasize some key aspects of this work in detail. First, our characterization is 
radically different from existing results such as in [4,) 13.( ;22| 128,129|, )48i [50j thanks to the separation 
of the convergence rates for primal optimality and the feasibility. We believe this is important 
since the separate constraint feasibility guarantee can act as a consensus rate in distributed 
optimization. Second, our assumptions cover a much broader class of problems: we can trade-off 
primal optimality and constraint feasibility without any heuristic strategy, and our convergence 
rates seem to be the best known rate for the class of fully nonsmooth problems <H3 so far. Third, 
our augmented Lagrangian algorithm generates simultaneously both the primal-dual sequence 
compared to existing augmented Lagrangian algorithms, while it maintains its O (ri) -optimal 
convergence rate both on the objective residual and on the feasibility gap. Fourth, we also 
describe how to adapt known structures on the objective and constraint components, such as 
strong convexity. Fifth, this work significantly expands on our earlier conference work [52] not 
only with new methods but also by demonstrating the impact of warm-start and restart. Finally, 
our follow up work [54] also demonstrates how our analysis framework and uncertainty principles 
extend to cover alternating direction optimization methods. 

1.3. Paper organization. In Section [2] we propose a smoothing technique with proximity 
functions for ( 1 . 1 )-(1.2) to estimate the primal-dual gap. We also investigate the properties of 
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smoothed gap function and introduce the model-based gap reduction condition. Section[3]presents 
the first primal-dual algorithmic framework using accelerated (proximal-) gradient schemes for 
solving 0 - fll.2[ ) and its convergence theory. Section [I] provides the second primal-dual algo¬ 
rithmic framework using averaging sequences for solving 0 - fll. 2 [ ) and its convergence theory. 
Section [5] specifies different instances of our algorithmic framework for (1.1)-(1.2) under other 
common optimization structures and generalizes it to the cone constraint Ax — c € 1C. A compar¬ 
ison between our approach and existing methods is given in Section [7j For clarity of exposition, 
technical proofs of the results in the main text are moved to the appendix. 


2. Smoothed gap function and optimality characterization. We propose to smooth 
the primal-dual gap function defined by (1.41 by proximity functions. Then, we provide a key 
lemma to characterize the optimality condition for 0 and (| 1 . 2 |). 


2.1. Basic notation. We use ||cc ||2 for the Euclidean norm. Given a matrix S, we define a 
semi-norm of x as ||a;||s := \J (Sx, Sx). When S is the identity matrix /, we recover the standard 
Euclidean norm. When S T S is positive definite, the semi-norm becomes the weighted-norm. Its 
dual norm exists and is defined by ||u||s,* = max{(tt,u) : ||u||s = 1}- When S T S is not positive 
definite, we still consider the quantity ||it||s,* = max{(tt,v) : ||u||s = 1 }, although Hulls,* is finite 
if only if u 6 Ran(S T ). 

We also use || • ||^ (resp. || • ||y) and || • ||^,* (resp. || • ||y,*) for the (semi) norm and the 
corresponding dual norm in the primal space X (resp. the dual space y). Given a proper, closed, 
and convex function /, we use dom (/) and df(x) to denote its domain and its subdifferential at 
x. If / is differentiable, then we use V/(x) for its gradient at x. For a given set C, 5c{x) := 0 if 
x £ C and Sc{x) := + 00 , otherwise, denotes the indicator function of C. 

For a smooth function / : Z —> R, we say that / has a Lj-Lipschitz gradient with respect 
to the semi-norm || • ||^ if for any z, z £ dom(/), we have ||V/(z) — Vf(z)\\z,* < Lf\\z — z\\z, 
where L(f) := Lf £ [0, 00 ). We denote by J 7 / 1 the class of all convex functions / with Lf- 
Lipschitz gradient. We also use /i/ = p(f) for the strong convexity parameter of a convex 
function / w.r.t. the semi-norm || • \\ z , he., /(•) — (pf/2)\\ • || \ is convex. For a proper, closed and 
convex function /, we use prox^ to denote its proximal operator, which is defined as pro 'x.f(z) := 
argmin„ {f(u) + (l/ 2 )||u - z\\%}. 


2.2. Smooth proximity functions and Bregman distance. We use the following two 
mathematical tools in the sequel. 


2.2.1. Proximity functions. Given a nonempty, closed and convex set Z, a continuous, 
and /Xp-strongly convex function p w.r.t. the semi-norm || • ||g is called a proximity function (or 
prox-function) of Z if Z C dom (p). We also denote 

z c := argmin {p(z) : 2 £ dom (p)} and Dz ■= sup {p(z) : z £ Z} , (2.1) 

as the prox-center of p and the prox-diameter of Z , respectively. Without loss of generality, we 
can assume that p p = 1 and p(z c ) = 0. Otherwise, we can shift and rescale p. Moreover, Dz > 0, 
and it is finite if Z is bounded. 

In addition to the strong convexity, we also limit our class of prox-functions to the smooth 
ones, which have a Lipschitz gradient with the Lipschitz constant L p > 1. We denote this class 
of prox-functions by <S^’ r For example, pz{z) '■= (1/2)||^||s is a simple prox-function in R"% i.e., 

Ill ‘ 111 e S l\- 

2.2.2. Bregman distance. Instead of smoothing the primal-dual problems 0-0 by 
smooth proximity functions, we use a Bregman distance defined via pz as 


bz(z, z) := pz{z) — Pz(z) — (Vp z (z), z — z), Vz,i£R n , (2.2) 

where pz £ Clearly, if we fix z = z c at the center point of pz, then bz{z, z c ) = pz{z). In 

addition, S7ibz(z,z) = 0 for all z £ Z. We use in the sequel V 6 z for S7\bz- 
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2.3. Basic assumption. Our main assumption for problems 0- ( | 1 . 2 | ) is to guarantee the 
strong duality, which essentially requires the following assumption (see, [l] Proposition 15.22]). 

Assumption A. 1. The solution set X* of the primal problem (or ( |1.3[ )) is nonempty. 
In addition, the following assumption holds for either (1.1) or (|1.3|): 


(a) The condition 0 G ri (dom (g) — A(dom(/))) for (1.1) holds. 

(b) The Slater condition ri (dom (/)) fl {Ax = c} / 0 for (1.31 holds. 

Under Assumption A|Tj the strong duality for (1.1)-(1.2) holds, see, e.g., [I]. The solution set 
y* of the dual problem ( | 1 . 2 [ ) is nonempty, and 

P* = f(x*)+g(Ax *) =D* = - f*(-A T y*) - g*(y*), Vx* G X*, Vy* G jW. (2.3) 


Let W* := X* x y* be the primal-dual (or the saddle point) set of (1.1)-(1.2). Then, (2.3) is 
equivalent to f(x*) + g(Ax*) + f(—A T y*) + g*(y*) = 0 for all (. x*,y*) G X* x y*. In addition, 
we can write the optimality condition of 0 -( | 1 . 2 | ) as follows: 

— A T y* G df(x*) and y* G dg(Ax*). (2.4) 


Note that this condition can be written as 0 G df(x*) + A T dg(Ax*) for the primal problem (1.1), 
and 0 G dg*(y*) — Adf*(—A T y *) for the dual problem ( |1.2[ ). 

2.4. Smoothed primal-dual gap function. The gap function G defined in (1.4) is convex 
but generally nonsmooth. This subsection introduces a smoothed primal-dual gap function that 
approximates G using smooth prox-functions. 

2.4.1. The first smoothed approximation. Let bx be a Bregman distance defined on X, 
and x G R" be given, we consider an approximation to £)(•) as 

D 1 (y\x) := - min {f(x) + (y, Ax) + jb x (x, x)} - g*(y) = -f*(-A T y) - g*(y), (2.5) 

where 7 > 0 is a dual smoothness parameter. The minimization subproblem in ( |2.5[ ) always 
admits a solution, which is denoted by 

x*(y;x) G argmin {f(x) + (y, Ax) + jb x (x, x)} . (2.6) 

x€.R n 

We emphasize that our algorithms presented in the next sections support parallel and dis¬ 
tributed computation for the decomposable setting of 0 or ( |1.3[ ), where / is decomposed into 
N terms as f(x) := YliLi fi( x i) with the i-th block being in R rai such that = n ■ I n 

this case, we can choose a separable prox-function to generate a decomposable Bregman distance 
bx (x, x) := Y^iLi bi(xi, xf) to approximate the dual function D defined in (1.2). By exploiting this 
decomposable structure, we can evaluate the smoothed dual function and its gradient in a parallel 
or distributed fashion. We will discuss the detail of this setting in the sequel, see, Section [5] 

2.4.2. The second smoothed approximation. Let by be a Bregman distance defined on 
y the feasible set of the dual problem (1.2) and y G R m . We consider an approximation to the 
objective g(-) in ( 1 . 1 ) as 


9p{w,V) '■= max {(u, y) - g*(y) - fiby(y, y)} , 


(2.7) 


where fj > 0 is a primal smoothness parameter. We also denote the solution of the maximization 

( 2 . 8 ) 


problem in (2.7) by y*p(u]y), i.e.: 


Z/SCu; y) ■= arg max {(u, y) - g*(y) - /3b y (y, y)} . 
We consider an approximation to the primal objective P as 


Ppix- y) := f(x) + gg(Ax; y). (2.9) 

This function is the second smoothed approximation for the primal problem. We note that if 
g(-) := < 5 { c }(-) and by(-) := ( 1 / 2 )|| • \\\, then yp(u; y) = y + f3 ~ x (u — c), which has a closed form. 
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2.5. Smoothed gap function and its properties. Given D 7 and Pp defined by (2.51 and 


(2.9), respectively, and the primal-dual variable w = ( x,y ), the smoothed primal-dual gap (or the 
smoothed gap) G 7( g is now defined as 

Gjp{w; w) := Pp{x; y) - D 7 {y; x), (2.10) 

where 7 and (3 are two smoothness parameters. 

The following lemma provides fundamental bounds of the objective residual P(x) — P* for the 
unconstrained form ( 1 . 1 ), and the objective residual f(x) — f* and the feasibility gap ||Ax — 6 || 


for the constrained form (1.3). For clarity of exposition, we move its proof to Appendix A.2 


Lemma 2.1. Let G 1 p be the smoothed gap function defined by (2.10) and Sp{x) := Pp{x;y) — 
P* = f(x) + gp{Ax;y) — P* be the smoothed objective residual. Then, we have 


Sp{x) < G 1 p{w, w) + rybxix*, x) and 


{Ax; y) - y*\\ y < b y {y*,y) + -Sp(x). ( 2 . 11 ) 


Suppose that g{-) := 5{ c }(-)- Then, for any y * £ y* and x £ X, one has 
- ||y*||y ||Ax - c||y,* < f(x) - f* < f{x) - g{y). 


The following primal objective residual and feasibility gap estimates hold for (1.3): 

f{x) - /* < Sp(x) - (y*,Ax- c) +/3by{y*;y), 

\\Ax-c\\ y ^ <PL by \[y*-y\\ y +(\\y*-y\\l + 2L-i(3~ 1 Sp{x)) 1/2 

where the quantity under the square root is always nonnegative. 


( 2 . 12 ) 


(2.13) 


The estimates (2.12) and (2.13) are independent of optimization methods used to construct 


{u> fc } for the primal-dual variable w = (x,y ). However, their convergence guarantee depends 
on the smoothness parameters 7 *, and /3fc- Hence, the convergence rate of the objective residual 
f{x k ) — f* and feasibility gap \\Ax k — c|| depends on the rate of {( 7 fc,/ 3 fc)}. 

The second inequality in (2.11) shows that the distance between y^{Ax; y) and y* is controlled 
by quantities that will remain bounded. In practice, we observed that y*p{Ax;y) seems to be 
converging to y*. Hence, restarting the algorithm with y' = y^{Ax;y), as we will do in the 
experiments will not hurt too much convergence, even in unfavorable cases. 

3. The accelerated primal-dual gap reduction algorithm. Our new scheme builds 
upon Nesterov’s acceleration idea mm- At each iteration, we apply an accelerated proximal- 
gradient step to minimize f + gp. Since f + gp is nonsmooth, we use the proximal operator of / 
to generate a proximal-gradient step. As a key feature, we must update the parameters 17 and 
(3 7 simultaneously at each iteration with analytical updating formulas. 

3.1. The method. Let x k £ X and x k £ X be given. The Accelerated Smoothed GAp 
ReDuction (ASGARD) scheme generates a new point {x k+1 ,x k+1 ) as 


vk+A M ;») 


Jfc+ 

c fc+1 


:= (1 - Tfe): 

:= arg max { (Ax k , y) - g*{y) - t3 k+ ib y {y;y)} , 
yty 


■ r k x- 

-k 


Pro 


(x k - /3 k+1 L A 1 A T y* fik+i {Ax k ;y)^ , 


(ASGARD) 


:= x k - T^{x k - x k+1 ). 


where r k £ (0,1] and (3 k > 0 are parameters that will be defined in the sequel. The constant La 


is defined as La := IIAll 2 = max] 

x£X l 


\\M 


y,* 


\x 


Y 
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The ASGARD scheme requires a mirror step in g* to get Up k+1 (-',y) in (2.8) and a proximal 
step of J. Computing this proximal step can be implemented in parallel when the decomposition 
structure in is available as discussed above. 


The following lemma shows that x k+1 updated by (ASGARD) decreas es the smoothed ob¬ 
jective residual Pp k (x k ) — P*, whose proof can be found in Appendix A.3.1 


Lemma 3.1. Let us choose tq := 1. If T k £ (0,1) is the unique positive root of the cubic 


polynomial equation t 3 / L ky + r 2 + r|_ 1 r — T k _i = 0 for k > 1, and 
p k —> 0 as k —> oo, and 


Pk -1 

l+Tfe-i/Lfc, 


P h+1 {x k+l ) - P* + 


T k La II ^*||2 < T k || ~.Q 112 


a; 1 — x 


\\x — x 


then 


(3.1) 


Pk+i 2 Pk+i 2 

Moreover, ^ <r k < an d if Lb y = 1, then p k < ffy- 

3.2. The primal-dual algorithmic template. Similarly to the accelerated scheme HEH], 


we can eliminate x k in (ASGARD) by combining its first and last line to obtain 


i k +1 = x k+1 + (1 Tfc)Tfc+1 ( x k+1 - x k ). 
Tk 


Now, we combine all the ingredients presented previously and this step to obtain a primal-dual 
algorithmic template for solving (1.1) as in Algorithm [l] 


Algorithm 1 (Accelerated Smoothed GAp ReDuction ([ASGARD) algorithm ) 


Initialization: _ 

1: Choose f)i > 0 (e.g., Pi := 0.5 \/Lja) and set r 0 := 1. 

2: Choose x° £ X arbitrarily, and set x° := x°. 

For k = 0 to fc max , perform: 

3: Compute Tk+i £ (0,1) the unique positive root of T 3 /Lb y + r 2 + r|r — r| = 0. 
4: Compute the dual step by solving 

y*p k +S A x k ',y) ■= argm| x{(Ax k ,y) - g*(y) - p k+ ib y {y,y)} . 

5: Compute the primal step x k+1 using the proxy of / as 


x k+1 := Picxp L -i f 


(x k - p k+ iL^A T y* Pk+i (Ax k ;y) 


6: Update x k+1 = x k+1 + Tfc+l(1 Tk) (x k+1 - x k ) and p k+2 := ——■ 

Tk 1 + L by Tk +l 

End for 


The computationally heavy steps of Algorithm |T] are given by Steps [4] and [5] At Step [4j 
yp k+i (Ax k ',y) needs a matrix-vector multiplication Ax and one prox g „. If g(-) := 5{ c }(-) and 
by(-) := (1/2)|| • |||, then we can compute y*p k+i (Ax k -,y) = y + pff+^AxF — c). At StepH the 
algorithm requires one proximal step on /, which can be implemented in parallel when / nas a 
decomposable structure. For this step, we need one adjoint matrix-vector multiplication A T y. 


3.3. Convergence analysis. Our first main result is the following theorem, which shows 
the 0(l/k) convergence rate of Algorithm [I] for both the unconstrained problem <0> and the 
constrained setting (1.3). 

Theorem 3.2. Let {ai fc } be the primal sequence generated by Algorithm^ for any pi > 0, 
and by be chosen such that L ky = 1. 
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If dom (g) = y, then the primal objective residual of (1.1) satisfies 


P{x k ) - P* < 
where y*(x k ) € dg(Ax k ). 


La ii _n 


2/3i k' 


t*\\% 


2/3i 
k +1 


by{y*(x ),y) for all k> 1. 


(3.2) 


If g(-) := <5{ c }(')j then the following bounds hold for problem (1.3): 
f{x k )-f* > -\\y*\\y\\Ax k ~c\\y,*, 

f(x k )-f* < l^Wx 0 - x*\\% + || y*\\y\\Ax k - cHy,. + ^b y {y*,y). 

II Ax k - clly,. < ^ ||y* - y\\y + (||t/* - y\\y + /3f 2 L A \\x° - a:*||^,) 1/2 

Proof. If dom (g) = y, then dg(Ax) ^ 0. Let y*(x) £ dg(Ax). We can show that 


(3.3) 



Pl+i 


0\ 


fc+1 T k 


2 — 


(diffcVn) {k+i)2 ~ 


'Pi(k + 1) 




we obtain the bounds as in (3.3). □ 

From (3.2), if dom (dg) is bounded, then by(y* (x k ),y) < sup ygdom( -g ff ) by(y,y ) < +oo. As a 
special case, if g is Lipschitz continuous, then dom (dg) is bounded |T|. For (3.3), note that if we 


choose y := 0 m , then the bounds (3.3) can be further simplified as 


m k )-r <i(^\\^-x*\\ 2 x +m\y 

\\Ax k -c\\y t *< ^ I (2|b*|| 


*l|2 


^||5°-x* 


\x\\y \\y 


^||5° -x* 


I,). 


(3.4) 


Clearly, the choice of in Theorem 3.2 trades off between H* 0 — x*\f x and || y* — y\\y on the 


primal objective residual f(x k ) — f* and on the feasibility gap \\Ax k — cH^*. 

4. The accelerated dual smoothed gap reduction method. Algorithm[l]can be viewed 
as an accelerated proximal scheme applying to minimize the function P 7 (-;y) defined in (2.9). 
Now, we exploit the smoothed gap function G 1 p defined by ( 2.10[ ) to develop a novel primal 
scheme for solving & Our goal is to design a new scheme to compute a primal-dual sequence 
{w k } and a parameter sequence {(7fe,/3*,)} such that max {0, G lk p k (w k ; u>)} converges to zero. 

4.1. The method. Given w k := ( x k ,y k ) £ W, we derive a scheme to compute a new point 


w k+1 := ( x k+1 ,y k+1 ) as follows: 


y k+1 

x k+1 


:= (1 - r k )y k +T k y*p k (Ax k ;y), 

-■= P r0X 7 k+l L-/g* (y k Alk+iL-^Ax. 
:= (1 - T k )x k +T k x* k+1 (y k ;x), 


7fe+i 


(y k 


(ADSGARD) 


where r k £ (0,1) and the parameters f3 k > 0 and 7fc+i > 0 will be updated in the sequel. The 
points x* k+1 (y k ;x) and y*^ k (Ax k \y) are computed by (2.6) and (2.8), respectively. This scheme 
requires one primal step for x* k (y k ’,x) : one dual step for y^ k (Ax k ■, y), and one dual proximal- 
gradient step for y k+1 . Since the accelerated step is applied to g 1 , we call this scheme the 
Accelerated Dual Smoothed GAp ReDuction (ADSGARDI scheme. 



























GARD 


The following lemma, whose proof is in Appendix A.4 shows that w k+1 updated by ADS- 
decreases the smoothed gap G lk p k (w k ) with at least a factor of (1 — 77 ). 

Lemma 4.1. Let w k+1 := (x k+1 ,y k+1 ) be updated by the ADSGARD scheme. Then, if 


r k £ ( 0 , 1 ], Pk and 7 k are chosen such that /3i7i > La and 


(1 + r k /L bx ) 7 fc+ i > 7 k , Pk+i > (l-Tk)Pk, and 


La , (1 - T k )p k 


< 


lk+i 


(4.1) 


then w k+1 £ W and satisfies G lh ^p k+1 {w k+1 \w) < (1 — T k )G lk p k (w k ;ii>) < 0. 

Let t 0 := 1. Then, for all k > 1, if we choose r k £ (0,1) to be the unique positive solution 
of the cubic equation Pz(t) := r 3 /L bx + r 2 + r|_ x r — r k _ 1 = 0, then jrrj < 77 < f or k > 1. 
The parameters /3 k and 7 *, computed by /?i 7 i = La and 


7fc+i : = 


7 k 


1 + T k /L bA 


and pk+i ■= (1 - T k )Pk, 


(4.2) 


satisfy the conditions in (4.1). 

In addition, if L bx = 1, then 7 *, < and 2l ^( k +i) — /^fc+i - k +1 f or k>\. 

4.2. The primal-dual algorithmic template. We combine all the ingredients presented 
in the previous subsections to obtain a primal-dual algorithmic template for solving ( 1 . 1 ) as shown 
in Algorithm [2] 


Algorithm 2 (Accelerated Dual Smoothed GAp ReDuction (ADSGARD)) 


Initialization: _ 

1 : Choose 71 > 0 (e.g., 71 := \J~La). Set P\ := ^ and r 0 := 1. 
2 : Take an initial point y k := y £ y. 

For k = 0 to fc max , perform: 

3: Update y k := (1 - T k )y k + r k y* k . 

4: Compute x k+1 in parallel with 


ife+i := argmin {f(x) + (A T y k ,x) + 7 fc +i b x (x,x)}. 


5: Update the dual vector 


y k+1 := prox 7 fe+iZ -i ff , (y k + lk+1 L/Ax* k+1 ) . 

6 : Update the primal vector: x k+1 := (1 — r k )x k + r k x k+1 . 

7: Compute 


Vk +1 : = argmax{(Aa: fc+ 1 ,y) - g*(y) - p k+1 b y (y,y)} . 

8 : Compute r k+ 1 £ (0,1) the unique positive root of r 3 / L by + r 2 + r|r — r 2 = 0. 
9: Update 7 fc +2 :=* —, and p k+2 := (1 - r fc+1 )^ fc+1 . 

“fc + l 

End for 


We note that since r 0 = 1, Step [ 3 ] shows that y° = y$, and while Step [6] leads to x 1 = x*. The 
main steps of Algorithm [ 2 ] are Steps]4j [5]and[7j where we need to solve the subproblem (2.6), and 
to update two dual steps, respectively. The first dual step requires the proximal operator prox pff , 
of g*, while the second dual step computes y k+1 = y*p k+1 (Ax k+1 -, y). 
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When g(-) := 5{ c }(-) the indicator of {c} in the constrained problem ( |1.3| ), we have 

y*g k i A x k ;y) = V&y (p^ 1 (Ax k - c),y) and y k+1 := y k + ^ k+1 (Ax * k+1 (j/ fc ; x) - cj . 

The first dual step only require one matrix-vector multiplication Ax. Clearly, by Step [6j it 
follows that Ax k+1 — c = (1 — r k )(Ax k — c) + r k (Ax* k+1 — c), and by Step [TJ we have y k = 
y*p k {Ax k \y) = Vfry ((3 k 1 (Ax k — c),y), which is equivalent to Ax k — c = Pk^by(y k ,y). Hence, 
Ax k+l — c = (1 — T k )f3 k X7by(yl, y) + ^f 1 {y k+1 ~ ll k ) due to Step |5j Finally, we can derive an 
update rule for y k as 


y*k +1 = Vb*y (V + \((l - T k )p k Vby(f k ,y) + -^( y k+1 - y k )),y) . (4.3) 

V 7fc+i J 

Consequently, each iteration of Algorithm [2] requires one solution of the primal step at Step [fj 
one matrix-vector multiplication Ax and the adjoint A T y. 


4.3. Convergence analysis. The following theorem shows the convergence of Algorith m [2 


For the constrained setting (1.3), we still have the lower bound on f(x k ) — /* as in Theorem 


3.2 


i.e. -\\y*\\y\\Ax k -c\\y t * < /(F) - f* for any F G A and y* G y*. 

Theorem 4.2. Let bx be chosen such that L\, x = 1, and {w k } be the sequence generated by 
Algorithm^ for solving (1.1), where 71 > 0 is given. 

If dom (g) = y, then the following convergence bound holds 


P{x k ) P* < ^b x (x*;x) + ^- by (y*(x k ),y), 


where y*(x k ) G dg(Ax k ). 


If g(-) := <5{ c }(-), then the following bounds for (1.3) hold: 
f(x k )-p >-\\y*\\y\\Ax k - c\\y,*, 

f{x k ) - /* < J^b x {x*,x) + ^b y (y* 7 y ) + \\y*\\ y \\Ax k - c||y,*, 


\\Ax k -c\\y^ < “I L by \\y*-y\\y + (\\y*-y\\y +^-b x {x*,x)) 


1/2 


(4.4) 


(4.5) 


Proof. Since Sp(x) < G^p{w\w) + "fbx(x*, x), using Lemma 


4.1 


G lk p k fw k \w) + 7 k bx{x*,x) < "/ k bx(x*,x). Similar to the proof of Theorem 


we can s how that Sp k (x k ) < 
we obtain the 


bound (4.4). The bounds in (4.5) are the consequences of Lemma 2.1 using /3 k < jf, j k < F/ 

and ir — F+r — 2 7 i 2 - □ 


3.2 


Similar to Theorem 3.2 we can simplify the bound (4.5) to obtain a simple bound as in (3.41, 
where we omit the details here. The choice of 71 and /3 1 in Theorem 4.2 also trades off the primal 
objective residual and the primal feasibility gap. 


4.4. The choice of the smoother. For this algorithm, one needs to choose a norm IHU = 
|| • || s and a smoother px which is strongly convex in the norm || • ||s- One possibility is to choose 
|| • ||s in order to have a simple formula for x k+1 = x*(y k ',x). A classical choice is a diagonal S 
and px(-) = ||| ■ — x||| for a given x G R". 

If / is decomposable as f{x ) = fi( x i) and we choose bx{x m ,x ) := bxi(xi', x i) , the 

computation of x k+1 at Step [4] of Algorithm [2] can be carried out in parallel. 

Another possibilty is to choose S = A and px = \ || ■ |||. In that case, the computation of 
x k+1 may require an iterative sub-solver but we are allowed to take x = x*. Indeed, as Ax * = c, 
we have that for all x, bx{x, x*) = ^\\x — x*\W = \{Ax — c ) T (Ax — c). Hence, we can consider x* 
as a center even though we do not know it. We shall develop the consequences of such a choice 
in the Section l5Jl 
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5. Special instances of the primal-dual gap reduction framework. We specify our 
ADSGARD scheme to handle two special cases: augmented Lagrangian method and strongly 
convex objective. Then, we provide an extension of our algorithms to a general cone constraint. 

5.1. Accelerated smoothing augmented Lagrangian gap reduction method. The 

augmented Lagrangian (AL) method is a classical optimization technique, and has widely been 
used in various applications due to its emergingly practical performance. In this section, we 
customize Algorithm [2] using ADSGARD to solve the constrained convex problem (1.3). The 
inexact variant of this algorithm can be found in our early technical report [53; Section 5.3]. 

The augmented Lagrangian smoother. We choose here px = || • ||^ = || • \\\, Py = IHly = INI/ 
and x = x* and b x {x, x) := (l/2)||A(x—x*)lly * = (l/2)||Aa;—c||^ This is indeed the augmented 
term for the Lagrange function of (1.3). Note that even though x is unknown, bx(x,x ) can be 


computed easily using the equality Ax * = c. 

We specify the primal-dual [ADSGARD scheme with the augmented Lagrangian smoother for 
fixed 7 fc+i = 7 o as follows: 


(ASALGARD) 


where 17 £ ( 0 , 1 ), 70 > 0 is the penalty (or the primal smoothness) parameter, and /3k is the 
dual smoothness parameter. As a result, this method is called Accelerated Smoothing Augmented 


y k 

■= (1 - T k)y k + T k y* 0k {Ax k ; y), 

x* 70 (y k ) 

:= argmin{/(a;) + ( y k ,Ax — b) + 

x$LX v 

yk + l 

■= y k + 7 o(Ax* 0 (y fe ) — c), 

x k+1 

:= (1 - r k )x k + T k x* 0 {y k ), 


Lagrangian GAp ReDuction (ASALGARD) scheme. 


This scheme consists of two dual steps at lines 1 and 3. However, we can combine these 


steps as in (4.3) so that it requires only one matrix-vector multiplication Ax. Consequently, the 


complexity-per-iteration of | AS ALGARD| remains essentially the same as the standard augmented 
Lagrangian method [7] . 

The update rule for parameters. In our augmented Lagrangian method, we only need to 
update Tk and (3k such that /3k+i > (1 — Tk)f3 k and 7 o/ 3 fc(l — 17 ) > r|. Using the equality in these 
conditions and defining Tk := iff , we can derive 


4+1 : = ^ + \Jl+ 4 tlj and /3 k +i := 


(4 - 1) 
tk 


f3k- 


(5.1) 


Here, we fix {3\ > 0 and choose to := 1. 

The algorithm template. We modify Algorithm [2] to obtain the following augmented La¬ 
grangian variant, Algorithm [3l 

The main step of Algorithm ]3ns the solution of the primal convex subproblem (5.2). In general, 
solving this subproblcm remains challenging due to the non-separability of the quadratic term 
||Ax — 6 |||. We can numerically solve it by using either alternating direction optimization methods 
or other first-order methods. The convergence analysis of inexact augmented Lagrangian methods 
can be found in ED3- 

Convergence guarantee. The following proposition shows the convergence of Algorithm [3j 
whose proof is moved to Appendix |A.5| 

Proposition 5.1. Let {w k } be the sequence generated by Algorithm^ Then, we have 

SL b \\y* \\y\\y*-y\\y+4b y (y* -y) 


8 Lh 


II y IIyII 2 / -y\\y 


7o(fc+2) 2 


<f{x k )-r< 


70C+2) 2 


y -v 


(5.3) 

O' 

ToU+2) 2 • 

As a consequence, the worst-case iteration-complexity of Algorithm^ to achieve an e-primal so- 


\\Ax k -b\\y^ < ^7 


lution x k for (1.3) is O 


by(y*,y) 

7o e 
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Algorithm 3 (Accelerated Smoothing Augmented Lagrangian 


GAp ReDuction (ASALGARD)) 


Initialization: 

l: Choose an initial value 70 > 0 and Bo := 1. Set to := 1 and /3i := jg , 
2 : Choose an initial point (x°,y°) £ X x y. 

For k = 0 to fc max , perform: 


3: Update y} k (x k \y) := \7h* y (/ 3 k 1 {Ax k - c),y). 
4 : Update 


x* 0 (il k ) ■= argmin{/(x) + (y k ,Ax-c) + y ||Ac - c||^*}. (5.2) 

5: Update y fe+1 := y k + 7 o(Ax* 0 (y k ) - c) and # +1 := (1 - t k x )x k + 1^ 1 x* 0 {y k )■ 

6 : Update 4 +i := 0.5 (l + ^1 + 4ff) and / 4+2 := (4+i - l)4+i/^+i- 

End for 


The estimate (5.3) guides us to choose a large value for 70 such that we obtain better con- 

51 


vergence bounds. However, if 70 is too large, then the complexity of solving the subproblem 
increases commensurately. In practice, 70 is often updated using a heuristic strategy 0IIJ. In 
general settings, since the solution x* k+1 computed by (5.2) requires to solve a generic convex 
problem, it no longer has a closed form expression. 

5.2. The strongly convex objective case. If the objective function / of is strongly 
convex with the convexity parameter pj > 0, then it is well-known [44] that its conjugate f* 
is smooth, and its gradient V/*(-) := £*(•) is Lipschitz continuous with the Lipschitz constant 
Lft := pi 1 , where x*(j is given by 


x*(u) := argmax{(u, x) — f(x)} . 

x£X 


(5.4) 


In addition, if f\(j := /*(—A T (-)), then V /a is Lipschitz continuous with Lj* := 

The primal-dual update scheme. In this subsection, we only illustrate the modification of 


ADSGARD to solve the strongly convex primal problem (1.1) as 

'k 


fc+i 


yk+l 


= (l- r k )y k ■ 

= (1 - T k )x k 
= prox M/g , (y k + p f Ax*(-A T y k )) . 


T k y} k ( Ax k ;y ) 
r k x*(-A T y k ) 


(ADSGARD^) 


We note t hat w e no longer have the dual smoothness parameter j k . Hence, the conditions (4.1) 
of Lemma |4.l| reduce to Bk+i > (1 — r k )/3 k and (1 — r k )B k > From these conditions we 

can derive the update rule for r k and (3 k as in Algorithm [3j which is 


4 +i 2 + \Jl + , Bk +1 ———^~fik 


and 


Tfc t k . 


(5.5) 


Here, we fix Bi '■= L/* = and choose t 0 := 1. 

Convergence guarantee. The following proposition shows the convergence of ADSGARD^ 
whose proof is in Appendix |A. 6 | 


Proposition 5.2. Suppose that the objective f of the constrained convex problem (1.3) is 


strongly convex with the convexity parameter pf > 0. Let {w k } be generated by (ADSGARD^ 
using the update rule (5.5). Then 


8L hy L A \\y* 


IlyllU-ylly 


nf(k+ 2) 2 


</(**)-/*< 


8L b L A \\y* \\ y \\y* -y\\y+4b y (y* ;y) 


Uf(k+ 2)2 


&L b L A \\y*-y\\y 


(5.6) 


\\Ax 6 ||y,* < ^fe+ 2 p 
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This result shows that (ADSGARD^I has the 0{l/k 2 ) convergence rate w.r.t. to the objective 


residual and the feasibility gap. We note that in both Propositions 5.1 and 5.2 the bounds only 
depend on the quantities in the dual space and La- 

5.3. Extension to general cone constraints. The theory presented in the previous sec¬ 
tions can be extended to solve the following general constrained convex optimization problem: 


/* := min {f(x) : Ax — b G K} , 

x£X 


(5.7) 


where /, A and b are defined as in (1.31, and 1C is a nonempty, closed and convex set in 


If 1C is bounded, then a simple way to process (5.7) is using a slack variable r G 1C such that 


r := Ax—b and 2 := ( x , r) as a new variable. Then we can transform ( |5.7| ) into (1.3) with respect to 
the new variable z. The primal subproblem corresponding to r is defined as min {(—y, r) : r G KC}, 
which is equivalent to the support function s/c{y) '■= sup{(y,r) : r G K} of 1C. Consequently, the 
dual function becomes g{y) := g{y) — s/c(y), where g{y) := min {f(x) + {Ax — b,y) : x G X}. 
Now, we can apply the algorithms presented in the previous sections to obtain an approximate 
solution z k := ( x k , f k ) wit h a converge nce g uarantee on f{x k ) — /*, \\Ax k — f k — 6 ||, x k G X and 
f k G 1C as in Theorem 3.2 or Theorem 


4.2 


If K, is a cone (e.g., 1C := R+, 1C is a second order cone £™, or A is a semidefinite cone S+), 
then with the choice p y (-) := (1/2)|| ■ || 2 , we can substitute the smoothed function fp in (2.7) by 
the following one 


gp(Ax,y) := max {(Ax - c, y) - {P/2)\\y - y\\ 2 : y G -1C*} 


(5.8) 


where 1C* is the dual cone of 1C, which is defined as 1C* := {z : (z, x) > 0, x G 1C}. With this 
definition, we use the smoothed gap function G 7 p as G 7 p(w;w) := Pg(x;y ) — D 7 (y;x), where 
D 7 (y;x) := min {f(x) + {Ax — c,y) + 7 bx{x,x) : x G X} is the smoothed dual function defined 
as before, and Pp(x;y) f{x ) + gp{Ax,y). 


In principle, we can apply one of the two previous schemes to solve (5.7). Let us demonstrate 


the ADSGARD for this case. Since 1C is a cone, we remain using the original scheme (ADSGARD) 


with the following changes: 

y* 0k ( A x k ;y) :=proj_K;. (y + P^{Ax k - c)) , 
y k+ 1 := P ro .)-/c* ( y k + ^ ( Ax *j k+ 1 ( y k ) - c) ) , 

where proj_ K , is the projection onto the cone —1C*. In this case, we still have the convergence 


guarantee as in Theorem 4.2 for the objective residual f{x k ) — f* and the primal feasibility gap 


dist ( Ax k — c,/C). We note that if 1C is a self-dual conic cone, then 1C* = 1C. Hence, y*p k {Ax k \y) 
and y k+1 can be either efficiently computed or a closed form. 


5.4. Restarting. Similar to other accelerated gradient algorithms in [25] [Ml |5Tj , restarting 
ASGARD and ADSGARD may lead to a faster algorithm in practice. We discuss in this subsection 
how to restart these two algorithms using a fixed iteration restarting strategy |46j . 

If we consider ASGARD] then, when a restart takes place, we perform the following: 


x k+1 

<- 

x k+1 , 

y 


yk+A A 

Pk+l 


Pu 

T~k+ 1 


1 . 

X k+X 

is classical, see, 1 


-fc. 


y)> 


(5.9) 


For the dual center point y, 


we suggest to restart it at the last dual variable computed. Indeed, by (2.11), we know that the 
distance between yp k+i {Ax k ',y) and the optimal solution y* will remain bounded. Hence, in the 
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favorable cases, we will benefit from a smaller distance between the new center point and y*, 
while in the unfavorable cases, restarting should not affect too much the convergence. 

For |ADSGARD[ we suggest to restart it as follows: 


y k+1 


y k+1 , 

y 


y k+ \ 

X 


X 7fc+i 

fdk+l 

<- 

Pi, 

lk+1 

«- 

7i) 

Tfc+1 

4- 

1. 


(5.10) 


Understanding the actual consequences of the restart procedure as well as designing other condi¬ 
tions for restarting are still open questions, even for the unconstrained case. Yet we observe that 
it often significantly improves the convergence speed in practice. 

6. Numerical experiments. In this section, we provide some key examples to illustrate the 
advantages of our new algorithms compared to existing state-of-the-arts. While other numerical 
experiments can be found in our technical reports j5Uj , we instead focus some extreme cases where 
existing methods may encounter arbitrarily slow convergence rate due to lack of theory, while our 
methods exhibits the Oil/k) rate as predicted by the theory. 

6.1. A degenerate linear program. We aim at comparing different algorithms to solve 
the following simple linear program: 

E n— 1 1 

k= 1 Xk 1) (6.1) 

x n - Y,k=1 Xk = 0 (2 < j < d ), 

x n > 0. 


mm 

s.t. 


The second inequality is repeated d — 1 times, which makes the problem degenerate. Yet, quali¬ 
fication conditions hold since this is a linear program. This fits into our framework with f(x) := 

2x n + ^{^^> 0 } (^n)i Ax := x k \x n - Y!k=\ x k\ " ' ! x n ~ Y2=\ x k], c := (1,0,- ■ ■ ,0) T G R d 

and g(-) := 5{ c }(-). A primal and dual solution can be found explicitly and by playing with the 
sizes n and d of the problem, one can control the degree of degeneracy. 

In this test, we chose n = 10 and d = 200. We implement both |ASGARD| and |ADSGARD| 
and their restarting variants. In Figure [O] we compare our methods against the Chambolle-Pock 
method |T3j. We can see that the Chambolle-Pock method struggles with the degeneracy while 


ASGARD still exhibits a 0{l/k) sublinear rate of convergence. 

In Figure |6.2[ we compare methods requiring the resolution of a nontrivial optimization 
problem at each iteration. In this case, the inversion of a rank deficient linear system, we thus 


compare (ASALGARD) with and without restart against ADMM [9]. For ADMM, we selected 
the step-size parameter by sweeping from small values to large values and choosing the one that 
gives the fastest performance. Here also, our algorithm resists to the degeneracy and restarting 
improves the performance. 

6.2. Generalized convex feasibility problem. Given N nonempty, closed and convex 
sets Xj C R" for i = 1, • • • , N, we consider the following optimization problem: 


N 


N 


mm 

c:=(xj ,••• ,a;J) T eR JV,! 


f(x) : = S Y ( X i ) : Xi = ° n 


( 6 . 2 ) 


2—1 


2=1 


where spCi is the support function of and Ai G R nxm is given for i = 1, • • • , N. 
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Fig. 6.1. Comparison of infeasibility (left) and function value (right ) for\ASGARD\ (solid blue line), \AS-\ 
\GARD\ with a restart every 100 iterations using < |5.9| ) (dashed pink line), ( ADSGARPj ) with a restart every 100 
iterations using (black dotted line) and Chambolle-Pock (green dash-dotted line). The dashed red line is 

the theoretical bound of \ASGARD\ (Theorem \3 .2]/ . \ADSGARD\ leads to similar results as \ASGARD\ on this linear 
program (|6.1|): the difference is not perceptible on the figure. 
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Fig. 6.2. Comparison of infeasibility (left) and function value (right) for ( |ASALGARr>[ i (solid blue line), 
||ASALGARD[| with a restart every 100 iterations (dashed red line) and ADMM (black dotted line). 


It is trivial to show that the dual problem of (6.2) is the following generalization of convex 
feasibility problem in convex optimization: 


Find y* £ R m such that: Aiy * £ Xi (* = !,•■■ , N). 


(6.3) 


Clearly, when A = I the identity matrix, (6.3) becomes the well-known convex feasibility problem. 
When Ai = I for some i £ {1, • • • , N} and Aj = A, otherwise, (6.3) becomes the multiple-set split 
feasibility problem as considered in the literature. Assume that (6.3) has solution and N > 2. 
Hence, (6.2) and (6.3) satisfy Assumption A0 


Our aim is to apply Algorithm 0 and Algorithm 0 to solve the primal problem (6.2), and 
compare them with the most state-of-the-art ADMM algorithm with multiple blocks [23] . Clearly, 
with nonorthogonal Aj, the primal subproblem of computing x, in the parallel-ADMM scheme 
m does not have a closed form solution, we typically need to solve it iteratively up to a given 
accuracy. In addition, by a change of variable, we can rescale the iterates such that ADMM does 


not depend on the penalty parameter when solving (6.2). With the use of Euclidean distance for 
our smoother, Algorithm 0 and Algorithm 0 can solve the primal subproblem (2.6) in Xi with a 
closed form solution, which just require the projection onto A). 

The first experiment is for N = 2. We choose Xi := |y £ M n : ey\ — YTj=i Vj — l} an d 

X 2 ■= {y £ R" : J2i=2 Vj — ~ 1} t° be two half-planes, where e > 0 is fixed. The constant e 
represents the angle between these half-planes. It is well-known [54] that the ADMM algorithm can 
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be written equivalently to an alternating projection method on the dual space. The convergence 
of this algorithm strongly depends on the angle between these sets. By varying e, we observe the 
convergence speed of ADMM, while our algorithms seem not to depend on e. Figure 6.3 shows 
the convergence rate on the absolute feasibility || x i\U of three algorithms for n = 10, 000. 


Since the objective value is always zero, we omit its plot here. 
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Fig. 6.3. Comparison of Algorithm^ Algorithm^ and ADMM with different values of e (left). Comparison 
of Algorithm^ Algorithm |g| and their restarted variant (restarting after every 100 iterations) (right). The number 
of variables is Nn = 20, 000. 

The theoretical version of Algorithm [l] and Algorithm [2] exhibits a convergence rate slightly 
better than 0(l/k) and is independent of e, while ADMM can be arbitrarily slow as e decreases. 
ADMM very soon drops to a certain accuracy and then is saturated at that level before it con¬ 
verges. Algorithm [l] and Algorithm [ 2 ] also quickly converge to 10 -5 accuracy and then make slow 
progress to achieve the 1CP 6 accuracy. We also notice that the averaging sequence of ADMM 
converges at the 0 (l/k) rate but it remains far away from our theoretical rate in Algorithm|T]and 
Algorithm [2] If we combine these two algorithms with our restarting strategy, both algorithms 
converges after 102 iterations. We see that Algorithm [T] performs very similar to Algorithm [2] 
We can also observe that the performance of our algorithms depends on La and initial points, 
but it is relatively independent of the geometric structure of problems as opposed to the ADMM 


scheme for solving (6.21. 


Xi 


Now we extend to the case N = 3 and N = 4, where we add two more sets X 3 and 
We choose X 3 := |y £ R" : 0.5eyi — Y^j =2 Vi = l} to be a hyperplane in R", and X 4 := 

€ R” : —y-\ + ]W ! = 3 yj < 1 j to be a half-plane in R”. We test our algorithms and the multiblock-| 
ADMM method in [2T] and the results are plotted in Figure 6.4 for the case n = 10,000. In both 

Algorithm [l] 
•••,7V. The 


cases, ADMM still make a slow progress as e is decreasing and N is increasing, 
and Algorithm [2] seem to scale slightly to N. We note that since A,; = I for i = 1 
complexity-per-iterations of three algorithms in our experiment is essentially the same, and is not 
necessary to report. 

7. A comparison between our results and existing methods. We have presented a 
new primal-dual framework and two main algorithms (one with a primal flavor and one with a 
dual flavor) together with two special cases. We now summarize the main differences between our 
approach and existing methods in the literature. 

Problem structure assumptions. Our approach requires the convexity and the existence of 


primal and dual solutions of (1.1). We argue that such assumptions are mild for (1.1) and (1.21 and 


can be verified a priori. We emphasize that existing primal-dual methods including decomposition 
[50], spitting, and alternating direction methods require other structure assumptions on either /, 
such as Lipschitz gradient, strong convexity, error bound conditions, or the boundedness of both 


the primal and dual feasible sets, which may not be satisfied for (1.1) [19, 18, 20. :T3j Wf \ 148 ] . 
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Fig. 6.4. Comparison of Algorithm\l\ Algorithm p| and ADMM with different values of e. The left plot is for 
N = 3 and the right one is for N = 4. 


Convergence characterization. We characterize the 0(l//c)-convergence rate of the objective 
residual P(x k ) — P* for 0 without any smoothness or strong convexity-type assumption on / 
or g. According to |44j . this convergence is optimal for the class of nonsmooth convex problems 
(1.1). Our convergence rate does not depend on the smoothness parameter 7 as in 31 04], but 
still requires the boundedness of dg(-). Unfortunately, we do now have the convergence of the 
iterates. 


For the constrained setting (1.3), we can characterize the O(l/fc)-convergence rate on both 
the primal objective residual f(x k ) — /* and the feasibility gap \\Ax k — c|| separately. We note 
that this rate has been often achieved via a Lipschitz gradient assumption on / as seen in m 
ns Enina Ha mi- Otherwise, without additional assumptions, we can obtain such a rate only in 
joint primal-dual variables ( x k , y k ) as in [13l l28l 129] . For multiple objective terms, [21] considered 
different types of parallel-ADMM algorithms, but the convergence characterization at the end is 
o(l/\/k) on the feasibility. 

Specifically, we develop two different types of algorithms, one with a weighted averaging 


scheme and without any averaging at all in the primal space. In addition, our bounds on (1.3) 


only depend on the distance between the initial points and an optimal solution instead of both 
the primal and the dual prox-diameters as in p~3l l28l 129] . which are applicable to problems with 
unbounded domain. 

Decomposition methods. Our basic methods broadly fall under the class of decomposition 
methods in contrast to alternating direction methods, such as ADMM. We note that the con¬ 
vergence rate guarantee in existing ADMM methods apply to the reformulated problems rather 
than the original one. Some authors directly developed ADMM for solving (1.1) [30]. However, 


unless other additional assumptions are imposed, ADMM is failed to converge with more than 
two blocks as shown in [13 E3- Other authors have attempted to provide conditions for the 
convergence of multiblock ADMM, e.g., [35]. These conditions are relatively generic and do not 
provide a clear recipe on how to choose the parameters properly. One fundamental disadvantage 
of ADMM is that it requires to process the linear operators in each subproblem, which is often 
nontrivial in many applications, while our methods do not require any augmented term and can 
avoid this drawback. 

Smoothing and smoothness parameter updates. We use differentiable smoothing functions in 
contrast to Nesterov’s smoothing technique in [32]. We propose explicit rules to update the 
smoothness parameters simultaneously. We emphasize that this is one of the key contributions 
of this paper. In Nesterov’s smoothing methods |44j . the smoothness parameter is set a priori 
proportional to the ratio between the accuracy and the prox-diameter Dx of X. It is clear that 
this parameter is often very small which leads to large Lipschitz constant in the smoothed dual 
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function. Hence, the algorithm with fixed smoothness parameter often has a poor performance. 
To the best of our knowledge, we propose here the first adaptive primal-dual algorithms for 
smoothness parameters without sacrificing the 0 ( 1 /k) rate. 

Averaging vs. non-averaging. Most existing methods employ either non-weighted averaging 
BSEMEDI or weighted averaging schemes mMm to guarantee the 0 (1/k) rate on the 
primal sequence. While we also provide a weighted averaging scheme, we alternatively derive a 
method without any averaging in the primal for solving (1.1). The non-averaging schemes are 
important since averaging may destroy key structures, such as the sparsity or low-rankness in 
sparse optimization. Our weighted averaging scheme has increasing weight at the later iterates 
compared to non-weighted averaging schemes [El E8 Eg EH. As indicated in mm, weighted 
averaging schemes has better performance guarantee than non-weighted ones. 

We have attempted to review various primal-dual methods which are most related to our 
work. It is still worth mentioning other primal-dual methods that based on augmented Lagrangian 
methods such as alternating direction methods (e.g., AMA, ADMM and their variants) |9j 131], 
l32l m \ , Bregman and other splitting methods p] [17l [23l [M] EH [37l [38] , and using variational 
inequality framework [13, 29, 27 . While most of these works have not considered the global 
convergence rate of the proposed algorithms, a few of them characterized the convergence rate in 
unweighted averaging schemes or use a more general variational inequality/monotone inclusion 
framework to study 0> and Hence, the results achieved are distinct from our results. 

Appendix A. Appendix: The proof of theoretical results. This section provides the 
full proof of Lemmas and Theorems in the main text. 

A.l. Technical results. We first prove the following basic lemma, which will be used to 
analyze the convergence of our algorithms in the main text. 

Lemma A.l. Let h be a proper, closed and convex function defined on Z, and h* is its 
Fenchel conjugate. Let bz be a Bregman distance as defined in (2.2) with a weighted norm. We 
define a smoothed approximation of h as 


hfj(z-z) := ma x.{(z,z) - h*(z) - fib z (z,z)} , 




(A.l) 


where z £ Z is fixed and /3 > 0 is a smoothness parameter. We also denote by Zp(z; z ) the solution 
of the maximization problem in (A.l). Then, the following facts hold: 

(a) We have a relation between the partial derivatives of (z, ft) ha hp(z;z) as 

dhp(z; z) 


<9/3 


-(P) = -b z (z*p(z;z),z) = -b z (Vhp(z;z),z). 


(b) For all z £ Z, j3 ha hp(z ; z ) is convex, and for fik+i, Pk > 0 o,nd z £ Z, we have 


hp k+1 (z-, z) < hp k ( z ; z) - (fi k - fik+i) 


dhp(z;z) 

<9/3 


(Pk+ i) 


(A.2) 


= h Pk( z \z) + (Pk - Pk+i)bz(fiJhp k+1 (z; z), z). 

(c) hp(-\z) has a 1/(j-Lips chit z gradient in || • \\z,*- Hence, for all z,z £ Z, we have 

1 

2/3 1 


hp(z\ z) < hp(z;z) + (Vhp(z;z),z - z) + -A||z - z\\\ . 


hp(z\ z) + (Vhp(z;z),z- z) < hp(z\z) - -\\S/hp(z; z) - \7hp(z\z )|||. 

(d) Both functions h and hp evaluated at different points z,z £ Z satisfy 

hp(z;z ) + (Vhp(z;z),z - z) < h(z) - fib z (\7hp(z\z),z). 


(A.3) 

(A.4) 

(A.5) 
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(e) If I 


is derived from a scalar product, then, for all r > 0, z,z £ Z, we have 


(l-T)\\Vhp(z-, z)-Vhp(z] z)\\% +T\\Vhp(z;z)-z\\% > T(l-T)\\Vhp(z;z)-z\\%. (A.6) 

(f) We can control the influence of a change in the center points from Z\ to z 2 using the 
following estimate: 


hp(z;z 2 ) < hp(z\zi) ~ %\\zp(z; iy) - Zp(y,z 2 


P 


b z {z*p{z\z 2 ),Zi) - b z {z* 0 {z\z 2 ),z 2 ) 


(A.7) 


Proof We prove from item (a) to item (f) as follows. 

(a) Since hp(y) is defined by the maximization of a strongly convex program in ( A.l| ), where 
the function in the max operator is linear in (3 and convex in z, the minimizer z*p(z\ z) is unique. 
By the classical marginal derivative theorem [35], the function is differentiable with respect to f3 
and 2 . In addition, V Z hp(z\ z) = Zp(z;z). 

(b) The function (3 H > hp(z\ z) is the maximization of a linear function in (3 indexed by y and 
y. Hence, it is convex. The rest follows by convexity of hp w.r.t. f3 and item (a). 

(c) Since f3b z ( •, z) is /3-strongly convex in the weighted-norm || • || 2 , hp(-\ z ) is ^-Lipschitz [J3j 
in the corresponding dual norm. The inequalities (A.3) and (A.4) are classical for convex functions 
with Lipschitz gradient [42] . 

(d) Let us denote here Zp := Zp(z-, z). Then, we can derive 

hp(z-,z) + (Vhp(z;z),z- z) = ((z, z* p ) - h*(zp) - flb z (zp, i)) +{z* 0 ,z-z) 

= ( z ,z*p) - h*{z*p) - /3b z (z*p,z) 

< max{(z,'ii) - h*(u)} - /3b z (z* g ,z) 

uSLZ, 

= h(z) - (3b z (\7hp(z\ z), z). 

(e) The classical equality ||(1 — r)a + rc|| 2 = (1 — T)||a|| 2 + r||c|| 2 — r(l — r)||a — c|| 2 directly 
implies the result for any norm || • || deriving from a scalar product. 


(f) Let us denote by Zp l = z*p(z\ Z\) and z*p 2 := Zp(z; z 2 ). Using the definition of hp in (A.l I 
and its optimality condition, we can derive 


hp(z;z 2 ) = ma x{(z,z) - h*(z) - /3b z (z,z 2 )} = (z,z*p 2 ) - h*(z*p 2 ) - (3b z (z*p 2 ,z 2 ) 

z£Z 

= {(Z’Zpa) ~ h *{z},i) ~ Pb z (z*p 2 ,y 1 )^ + flb z {z*p 2 ,z 1 ) - flb z (z*p !2 ,z 2 ) 

<( z ^ z p,i)- h *{zp A )-/3b z {z*pp,z 1 )-^\\z*p 1 -z*p 2 \\l > +flb z (z*p^z 1 )-l3b z (z*p 2 ,z 2 ) 

= hp(z;zi) - ^\\zpp - Zp t2 \\y +fl(b z (z*p t2 ,z 1 ) - b z (z*p >2 , i 2 )) , 


which proves (A.71. □ 


A.2. The proof of Lemma [27l] Key bounds for approximate solutions. We consider 
the smooth objective residual Sp(x) := (f(x) + gp(Ax;y)) — ( f(x*) + g(Ax*)). By using the 
definition of gp, we can derive that 


gp(Ax;y) =ma x{(Ax,y) - g*(y ) ~ flb y (y,y)} 
v&y 

> (Ax, y*) — g*(y*) -flb y (y*,y) 

= (Ax - Ax*, y*) + (Ax*, y*) - g*(y*) - /3b y (y*, y) 

= (A(x - x*),y*) + g(Ax*) - /3b y (y*, y), (A.8) 
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where the last line is the equality case in the Fenchel-Young inequality using the fact that A T x* £ 
dg*(y*). Similarly, we have 


/* (- A T y; x) = max {(- A T y, x) - f(x) - 7 b x (x, x)} 


xex 


> (A (y* - y),x*) + f*(-A y*) - ”/b x (x*, x). 


(A.9) 


Combining (A. 8 ), (A.9), the definition (2.10) of G 1 p(-\w), and the strong duality condition ( |2.3| ) 
we can show that 

G 7 / 3 (w) := Pp(x) - D 1 (y) 

= f(x)+ gp (Ax;y) + f*(-A r y;x) + g* (y) 

® Sp(x) + f*(-A T y-x) + g*(y) - f*(-A T y*) - g*(y*) 


(A.10) 


^ Sp(x) + (A T (y* - y),x*)+g*(y) - g*(y*) - 7 b x {x*,x) 
> Sp{x) - 7 b x (x*,x), 


where the last inequality holds because g* is convex and Ax* £ dg*(y*) due to (2.4). This proves 
the first inequality of ( 2 . 11 ), . 

Since by(-;y) is 1 -strongly convex w.r.t. the weighted-norm, using the optimality condition 
of the maximization problem in (2.7) at y := y*, and u := Ax , we obtain 


gp(Ax;y) > (Ax,y*) - g*(y *) - f3b y (y*,y) + -\\y*p(Ax;y) - y*\\ y . 


(ATI) 


By (|2.4[), we have —A T y* £ df(x*). Using this and the convexity of /, we have f(x) > f* + 


{A(x — x*),y*). Summing up the last inequality and (A.ll), then using the definition of Sp(x), 
we obtain 


\\y*p( Ax ',y)-y*\\y < fib y (y*,y)+Sp(x)+g(Ax*)+g*(y*)-(Ax*py*) < f3b y (y*,y) + Sp{x), 


which implies the second estimate in (2.11), where the last inequality is due to the Fenchel-Young 
equality g(Ax*) + g*(y*) = (Ax*,y*). 

Now, we consider the choice g(-) := d{ c } (•) in the constrained setting (1.3). Under Assumption 
A0 any w* := (x*,y*) £ W* is a saddle point of the Lagrange function C(x,y) := f(x) + 
(Ax — c,y), i.e., C(x*,y) < C(x*,y*) < C(x,y*) for all x £ A and y £ M m . The dual function 
D in ( |1.2[ ) becomes D(y) := —f*(—A T y) — c T y = min s {f(x) + (Ax — c,y)}. It leads to D(y) < 
D* — f* < f(x) + (y*. Ax — c), and hence 


f(x) - D(y ) > f(x) - /* > (c — Ax, y*) > -||y*||y||Tx - c||y,* 


(A.12) 


for all (x,y) £ W, which proves ( |2T2| . 

Finally, we prove (2.13). Indeed, using the definition of g and gp, and Ax* = c, we can write 

f(x) - /( x*) = /( x) + gp(Ax- y) - f(x*) - g(Ax*) - gp(Ax; y) + g(Ax*) 

fOl 


= Sp(x) - gp(Ax ; y) + g(Ax*) S Sp(x) - (A(x - x*), y*) + /3b y (y*, y) 
JaToI 

< G 1 p(w;w) + (c-Ax,y*) + (3b y (y*, y) + ^b x (x*, x). 


We then using the lower bound inequality (2.12) to get 

(■ V *, c - Ax) < f(x) - f* < Sp(x) - gp(Ax\ y) + g(Ax*) = Sp(x) - gp(Ax; y), 
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(A.13) 


















where g(Ax *) = 0 due to the feasibility of x*, i.e., Ax* = c. Now, it is obvious that 

9 p( A x',y) ■= sup {{Ax — c,y) - Pb y (y,y)} > (Ax - c,y*) - fib y (y*,y). 
y&y 


Hence, combining this estimate and (A.13) we obtain the first inequality in (2.13). 

As 'Vby(-,y) is Lb y -Lipschitz continuous, by(y,y) = 0 and \7by(y, y) = 0, we have 


5/3 (Ax; y) = sup {(Ax — c, y) - f3by(y,y)} > sup {(Ax - c,y) - ^^-\\y-y\\ 

v&y y&y L L 


1 


-||Ax - c||^* + (y,Ax - c). 


2 l3L by 

The last equality comes from the formula of the Fenchel conjugate of the squared norm. Combining 
this inequality with ( A.13| ), we obtain 

(y*, c - Ax) < Sp(x) - 2L ^ || Ax - c||^ - (y, Ax - c) 

Rearranging this expression and using the Cauchy-Scliwarz inequality, we obtain 
— lb* - y\\y\\ A x - c||y,* < Sp(x) - (2L by p)~ l \\Ax - c||^*, 

which leads to 

11 Aar - c\\y^ - 2(3L by \\y* - y||y||Ax - c||y,* - 2L by /3Sp(x) < 0. 

Let t := || Ax — c||y ; *. We obtain from the last inequality the inequation t 2 — 2 (3L by \\y* — y\\yt — 
2L by j3Sp(x) < 0. This inequation in t has solution. Hence, ||j/* — y\\y + 2L b y/3 1 ^/ 3 (ar) > 0 and 

t := ||Ax - c||y,* < pL by f|b* - y||y + (|| L by y* - y\\l + 2L^/T^(x)) 1/2] 


which is the second estimate of (2.13). 


□ 


A.3. The convergence analysis of the ASGARD| method. In this appendix, we provide 
the full convergence analysis of the |ASGARD algorithm. First, we prove a key inequality to 
maintain the optimality gap reduction condition. 

Lemma A.2. Let us denote Sk := Pp k (x k ) — P* = f(x k ) + gp k (Ax k ;y) — f(x*) — g(Ax*). If 
Tk £ (0,1], then 


a , 


i k+1 - x*||* < (1 - T k )S k + 


LA T k II ~k „*||2 


|x fc -x 


Pk+1 Pk +1 

+ (1 - Tk) [(Pk - Pk+i)L by - Pk+iTk] || Vgp k+1 (Ax k ;y) - y\\ 2 y . 


X 

-k . • \ -112 


(A. 14) 


Proof. Using Lemma A.l with h := g and hp := gp, Z := y and 2 := Ax, we can proceed as 


f(x k+1 ) + gp k+1 (Ax k+1 \y) fP /(x fe+1 ) + gp k+1 (Ax k ; y) + (\7gp k+1 (Ax fe ; y), Ax k+1 - Ax k ) 


+ WT— ||Ax fc -Ax fc+1 ||2 


2/3fc+i 


y,* 


< f(x k+l ) + 9? k+1 (Ax , y) + (A y*p k+1 (Ax fc ; y), x fc+i - x k ) 


+ ^l 

2 pk+i 

def. of Xkr |-i 


x k ^\\l 


< 


f(x) + gp k+1 (Ax ;y) + (A y* 0k+i (Ax fc ; y), x - x ) 


+ 


L a 

2 pk+i 


|x — ar|| a- — 


La 

2Pk+i 


x k+1 - x| 


X > 


(A.15) 
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where the last inequality comes from the definition of ir fc+1 by using its optimality condition and 
the functions value at x £ X. 

Our next step is to choose x := (1 — r k )x k + r k x*. In this case, we have 
x-x k = (1 - T k )x k + T k X* - (1 - T k )x k - T k X k = T k (x*-X k ), 

'k\ 


x-x k = (1 - T k )x k + T k X* - (1 - T k )x k - TkX' k = ( 1 - T fc ) (x k - X k 
X -X k+1 = (1 - T k )x k + T k X* - X k - T k ($ k+1 ~ X k ) = T k (x* - X k+1 ). 


Now, we plug these expressions into (A.15) and using the convexity of /, we can derive 
f{x k+1 ) + gp k+1 (Ax k+1 ;y) < (1 - r k )f(x k ) + r k f(x*) + g Pk+1 (Ax k ] y) 

+ r k (A T y ; k+i ( Ax k ; y),x*~ x k ) + (1 - r k )(A T y} k+l {Ax k -y), x k - x k ) 
LA T k II ~k *||2 ^ A T i 


+ 


2/3fe+i 


IF -x \\ x - 


2/3fc+i 


x k+1 - X* 


\x 


X41+1X5 

— < — 


(1 - T k )f(x k ) + r k f(x*) + r k g{Ax*) - T k p k+1 by(S7 gp k+1 (Ax k \y),y) 

+ (1 ^ r k )g Pk+1 ( Ax k ; y) - (1 - T k )^~-\\S7g Pk+1 ( Ax k -,y ) - S7g h+1 ( Ax k ; 


\\l 


^ AT k || ~k „,*||2 LA T k ||~fc+l 


2/3fc+l 


IF -x \\ x - 


2/3fc+i 


x — x 


\x 


ST (! - T k )f{x k ) + r k f(x*) + r k g(Ax*) - Tk/3k+1 II S7g 0k+1 (Ax k ;y) - y\\y 
+ (1 - r k )gp k (Ax k ;y) + (1 - r k )(p k ~ Pk+i)by(Vgp k+1 (Ax k ;y),y) 

- (1 - T k) /3> Y L \\Vg 0k+1 (Ax k -,y) - Vgp k+1 (Ax k ; y)\\y 

+- xr x - §f^\\x k+i - x*w% 

*Pk+l *Pk +1 

^ (1 - x k )f(x k ) + (1 - T k)gp k {Ax k ;y) + r k f{x*) + r k g(Ax*) 

+ (/ 3 k - /3fc+i)(l - T k )by(yg@ k+1 (Ax k \y),y) 

- /3 y i r fe (l - T k )\\Vg 0k+1 (Ax k -,y) - y\\^ 

LA T k n ~k „*n2 Lat% m fc+ i 


2/3fc+i 


IF -x \\x~ 


2/3fc+i 


x—-x*f x . 


Finally, using the Lby-Lipschitz continuity of S/by in the weighted-norm || • ||y and the fact that 
S7by(y,y) = 0, we obtain (A.14) from the last derivation. □ 

A.3.1. The proof of Lemma |3.1[ Small smoothed primal optimality gap. Let us 

denote S k := Pp k+1 (x k ;y) - P* = f(x k ) + gp k (Ax k \y) - f(x*) - g(Ax*). Using ( |A.14| ) from 
Lemma [A.2[ we have 

.2 f , ‘2 


S k+l + ^\\x k+1 - x*\\% < (1 - r k )S k + ^\\x k 

Pk+1 Pk +1 


\X 


+ [Wk - Pk+i)L by - p k+ iT k \ \\S7gp k+1 (Ax k ; y) - y\\* . 


(A.16) 


In order to remove the last term in this estimate and to get a telescoping sum, we can impose the 
following conditions: 

,Pk+l Pk 


(Pk - Pk+i)L by = p k +iT k and (1 -r fc )- 


(A.17) 


'k -1 
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By eliminating /3k and (3k+i from these equalities, we obtain r|( 1 + Tk/L by ) = t|_ 1 ( 1 — Tfc). 
Hence, we can compute Tk by solving the cubic equation 


p 3 (r) := r 3 /ib y + r 2 + t^t - T k _ x = 0 . 


(A.18) 


At the same time, we also obtain from (A.171 an update rule f3k+i '■= 

- i 77T 


< /3fc- 


Now, we show that (A.18) has a unique positive solution Tk £ (0,1) for any L by > 1 and 
Tk -1 £ (0,1]. We consider the cubic polynomial p 3 (r) defined by the left-hand side of (A.18). 
Clearly, for any r > 0, we have p 3 (r) = 3 t 2 /L by + 2r fc + t|_ 1 > 0. Hence, p 3 (-) is monotonically 
increasing on (0,+ 00 ). In addition, since p 3 (0) = — T k-i < 0 and p 3 (l) = 1 /L by + 1 > 0, the 
equation (A.18) has only one positive solution Tk £ (0,1). 

Next, we show that Tk < ^4 2 - Indeed, by ( |A.18 ) we havep 3 (r) > r 2 + tu-it — t %_ 1 := p 2 (t). 

Since the unique positive root of P 2 (j) = 0 is Tk := T k _ x + 4 — Tfc_i^, we have p 3 (r) > 

p 2 (ffc) = 0 for r > ffc. As p 3 {r) is monotonically increasing on R + , its positive solution T k much 
be in (0,ffc]. Hence, we have Tk < |^^ / /r 2 _ 1 + 4 — Tfc.^. By induction, we can easily show 

that Tk A ^._|_2 ■ 

We show by induction that t*, > First of all, by the choice of r 0 , we have r 0 = 1 > y—y. 
Suppose that Tk -1 > we show that Tk > jrry. Assume by contradiction that Tfc < Then, 
using (A.17) we have 


1 2 _ 2 1 A T k/L by 

¥ - Tfc - 1 " Tfe l-Tk 


< 


1 + 

^ k-\-l 


k -h 1 H - -Zvfr, 




(fc + 1) 2 


This is equivalent to ( k + l) 2 < k/k + 1 + L by ), which contradicts our assumption. Hence, if 
T k ~ 1 > then we have Tfc > We conclude that < Tfc < for k> 0. 

By the update rule /3fc+i := , ^\ k of /3k, we can show that 

+ L *y 


Pk+l = 


f3k 


1 + T k/L by 


</3 k 


k +1 


k + 1 4 - L 


-<an 


l + l 


; =1 1 + 1 + L by k ^°° 


0. 


Clearly, if L hy =1, then /3 fe+1 = < |±|^fc < |^ by induction. 


/c+2 ^ 


/c+2 


Finally, we upper bound the ratio t%/ f3k +1 using the second equality in (A.17) as 


ftk+l 


'fc-i , 


' Z=1 ^ Z=1 r Z=1 


pxik+iy 


Using these relations into (A.16) we obtain 


^S k+1 + ^\\x k+1 -x*W% < A -Sk + ^\\x k -x*W% ^ 


'k -1 


c /?o( 1 2 ro) g 0 +^||x 0 -^ 112 

t 0 2 


x > 


□ 


we get (3.1) with noting that to = 1, the bound on and Sfc := Pp k+ 1 {x k ) — P*. 

A.4. The proof of Lemma \4.1\ Gap reduction in ADSGARD[ For simplicity of 
notation, we denote by f£{y) := f* k+l (-A T y,x) defined by fl2.5 | ), y* k :=y* Ri {Ax k \y), x * k+1 := 
< ■ hi .Ay k \x) and 


J k+l 

-1 


:= x* (y k ;x). By (A.3), V/* is Lipschitz continuous with the Lipschitz 


constant Lf* := 7 1 and thus V/^ is Lipschitz continuous with the Lipschitz constant 7 fc + 1 L J 4 - 
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First, using the optimality condition for problem ( |2.9| ), we obtain 

f(x k ) + (Ax k , y) - P k b y {y,y) + g*{y) < P 0k (x k ;y) - (/3 fc /2)||y - y* k \\y. 

Second, using the definition of f*(-;x) in ( |2.5[ ), we can show that 

(Ax* k+1 ,y) + f{x* k+1 ) = -~/ k+1 b x (x* k+1 , x) - f k {f) + (Ax* m ,y - y k ) 

= - 7fe+ i b x (x* k+1 ,x) - f* k (y k ) - (V/ fc *(y fc ),y - y k ). 

Third, using ( |A~2| for f* and the co-coercivity ( |A.4[ ) of /* ( -;x), we can derive 


(A.19) 


(A.20) 


~D lk {y k -x) = f ; k (-A T y k -x) + g*(y k ) 

? fj k+1 (- AT y k '^) +9*(y k ) - ( 7 k - lk+i)b x (x* k+1 -x) 

^ f; k+1 (- A T y k ; x) + (V/* fc+1 (- A T y k ; x),A T (y k - y k )) 

+ ^11 Vf; k+1 (-A T y k -,x) - Vf* k+1 (—A T y k ; x)\\% 

+ 9 * it ) - (7 k -' r k +i)b x {x* k+1 ,x ) 

= flit) + V flit), y fe - y fc > + ^||4 + i - ^ill* 

+ 3*(j/ fe ) - (7fe - 7 k+i)b x {x* k+1 ,x). (A.21) 

Then, by the definition of y k+1 , we can write 

d ih-i it + 1 ;i)=-g*it +1 ) - fj k+ 1 (~ AT y k+ \x) 

> - 9 *(y k+1 ) - flit) - (v fl it), y k+1 —t)~ \\t +1 ~t\\y 

^7/c+i 

= ~^iy[ti u ) + f k (y k )+{^f k (y k ),u-y k )+^^\\u-y k \\‘yj ■ (A.22) 


Using these relations, the definition of x k+1 and the convexity of /, we have 
PPm -1 (® fe+1 ; y) = /(^ fe+1 ) + max {(Ai fc+1 , y) - y*(y) - p m by{y, y)} 

S max {(1 — T k ) [f{x k ) + {Ax k ,y) - P k by{y,y) + g*(y)\ 
yty 

+ T k [(Ax* k+1 ,y) + f{x* k+1 ) +g*(y)] } 

|AA9} + ( |A.20| | 

< (1 - T k )Pp k (x k \y) - Tfc 7 fc_|_i b x {x* k+1 , x) 
-min{ T fc /^(y fe )+r fe ( v /fc(y fe ),y-y fe ) + ^—j^\\y-y* k \\y+T k g*(y)^ 

JA.21[ 

< (1 - Vc) [Pp k (x k ;y) - D Jk (y fc ; £)] - r fc 7 fc+ i&^(xfc +1 , i) 

- —— ^ 7fc+1 ||Xfc +1 -X% +1 \\ 2 X + (l-r fe )(7/ c -7 fc+ i)6 A r(5fc + i,i) 

- min |/fe(y fc ) + (V/ fc *(y fe ), (l-r fc )y fe +T fc y-y /c ) 
y&y t 

+ ^ T ^ k \\y-y k \\y + g*ii 1 - ^)y fe + ^2/)} 
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Let us define an auxiliary term Tk as 


T k '■= 0--Tk)^\\x% +1 ~x* k+1 \\ 2 x - (1—Tfc)(7fe — 7fcfi) bx j x) 

+Tk7k+ibx{xl H ,x) 


(A.23) 


and let us consider the change of variable u := (1 — Tk)y k + T k y for y £ y. Then u G y and 
u - y k = T k (y - yl) we have 


-P/Sfcf! (x k+1 ',y) < (1 -T k )G lhf j k (w k - w) - Tk 


— mm 
uey 


in {f*k(y k ) + (Vr k (y k ),u-y k ) 


(1 - T k )Pk 


2 r 2 k 

JA.22^ + ([4?TI 

-< (1 - Tk)G lkPk ( w k ; w) + D 7w (y fc+1 ; i) - T fc , 




(A.24) 


Finally, we estimate Tk in (A.23) using the strong convexity of bx{-,x) as follows: 

2A > (l-r fe )7fc+ill4+i - xl+iWx + T /c7fcfi||^fcfi - x\\x 


- (1 — Tfc)(7fe — jk+i)Lb 


X ll^fc+l 


- X 


X 


(i - Tk) [Tfc7fc+1 - (7fc-7fcfi)Aj ||4 +1 - x\\x 

4A\ 


(A.25) 


Substituting ( |A.25| ) into ( |A.24[ ), we get G 7 fcfl|Sfefl (u; fc+1 ; w) < (1 - T k )G lk p k {w k \w). 

Note that this is valid for all k > 1 . Using similar ideas together with the relations x 1 = x* 
and y° = , we also get 

G 11 ,p 1 {w 1 ;w) < -7iMs\x) + - ySlly - Piby(yl,y) 

As / 3 i 7 i > L a and := y, we get G^p^w 1 -, w) < 0 . 


Next, we set the equality in three conditions of (4.1 1 to get 7 k+i = 7 fc(l + Tk/L bx ) 1 , fik+i = 
(1 - T k )p k and (1 - Tk)Pklk+i = T k^A- In particular, 7 k +iPk+i = t^L a and thus 71 /?i = L A . By 
eliminating and /3k, we obtain r^/L bx + t% -f T^_ 1 T k — t/_ 1 = 0. Hence, similar to the proof 
of Lemma |3.1| we can show that 77 £ (0,1) is the unique positive solution of the cubic equation 
Ps(t) := TyL bx + t 2 + tI_ x t - t /_ 1 = 0. In addition, ^ < r k < ^ for fc > 1 and r 0 = 1. 

If L bx = 1, then 7^+1 = Similarly, /3 fc+ i = (1 - r fc )^ fe < AtA < At- 


/c+2 — fc+2* 

fc +2 


fc+1' 
□ 


Finally, we note that p k +i = ^7 > (FFTp^T ^ 2 7l ffc+i) - 

A.5. The proof of Proposition [5.1t The accelerated augmented Lagrangian method.| 

First of all with the choice of norms associated to the Lagragian smoother, we have 


L a := || A|| = max ■ 

X(=lX 


\M\y,* 


lx 


= max • 
xex 


\\A* 


ly,* 


\\Aa 


= 1. 


ly.* 


Secondly, note that the conclusions of Lemma |4.1| are valid for any semi-norm. In particular, if 
we choose /?i 7 o > L A = 1, 


7fe+i = 7o > 


7o 


Pk+i = (1 - T k )Pk, 


1 + Tk/L bx 

then G JO! p k+ 1 (w k+ 1 ,w) < (1 - T k )G l 0 ^ k {w k ,w) < 0. 


and — - (- 1 ~ T k)Pk 


7o 


Tk 
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Eliminating Bk+i and /3 k in the equalities, we get 

r 2 

T k +1 _ 2 

1 ' 

t — Tk+l 

k T 2 T 2 O 

One can easily check by induction that /?& = /3i ri/=i(l — d) = /?i^| = ^ and 77 . < We 

then conclude using Lemma [2d] and the fact that bx(x*,x) = 0: 

Sp k ( x k ) < G l0 p k {w k , w)+ 0 < 0, 


l|Ar fe 




< PkLby 




yf y + 2L^^ 1 S 0k (x k )) 1/2 


K 8 L b y \\y* - y\\y 

~ 7o(fc + 2) 2 


and 


f{x )- f* < Sp k (x ') - ( y*,Ax ' -c) + /3 k by{y*;y) 


< \\y*\\y\\Ax k - clly,* + P k by(y*-, y) < 


8L by\\y*\\y\\y* ~ il\\y + 4b y {y*;y) 


7o(fc + 2) 2 


f(x k )-r>-\\y*\\y\\Ax-c\\ y ,,>- 


8 L b y \\y*\\y\\y* - y\\y 

7o(fc + 2) 2 


The proposition is proved. 


□ 


A.6. The proof of Proposition |5.2| The strongly convex objective case. The proof 


follows the line of Lemma 

- La 


4.1 


We only need to replace the Lipschitz continuity coefficient 


Lf^ = Aj in (A.221 and replace all other occurrences of 7 fc+i by 0. Under a choice of parameters 
satisfying (5.5), we obtain the gap reduction condition G 0 ,^ fc+1 (w fe+1 ; w) < (l — Tk')Go,/3 k (w k ;w) < 


0 as in Lemma 


4.1 


We can also check by induction that f} k < ^ fc+2 ) 2 


of Proposition by using Lemma |2.1 


M/ 


We obtain the conclusion 

□ 
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